[gluon][pa_mqa_logits] memory-safety: mask all OutLogits buffer_store lanes (does NOT fix long-context accuracy)#2936
Draft
maeehart wants to merge 3 commits into
Draft
Conversation
…le+varctx) Guard every gl.amd.cdna3.buffer_store targeting OutLogits_buffer with col < max_model_len, AND-ing with existing split_context_start masks where present. Prevents SIMD lanes from writing past allocated logits length on long-context / SplitKV tile boundaries. See ROCm/aiter community validation: numerical bounds worked example in PR. Made-with: Cursor
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes potential out-of-bounds writes in the Gluon pa_mqa_logits preshuffle kernels by ensuring every gl.amd.cdna3.buffer_store into OutLogits_buffer is predicated so SIMD lanes with logical column index col >= max_model_len do not execute the store. This targets the _preshuffle and _preshuffle_varctx kernel variants implicated in HIP memory-access faults on long-context paths.
Changes:
- Add
col < max_model_lenupper-bound predicates to multipleOutLogits_bufferbuffer_storesites in_gluon_deepgemm_fp8_paged_mqa_logits_preshuffle. - Add the same upper-bound store masking in
_gluon_deepgemm_fp8_paged_mqa_logits_preshuffle_varctx. - Where a lower-bound predicate already existed (e.g.
>= split_context_start), combine it with the new upper bound via&.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
594
to
+598
| + ( | ||
| context_idx | ||
| + gl.arange(0, ChunkKPerStage, layout=gl.SliceLayout(0, mfma_layout)) | ||
| ), | ||
| mask=context_idx | ||
| + gl.arange(0, ChunkKPerStage, layout=gl.SliceLayout(0, mfma_layout)) | ||
| >= split_context_start, | ||
| mask=( |
Made-with: Cursor
…tores Align _gluon_deepgemm_fp8_paged_mqa_logits buffer_store with logits allocation: AND (col < max_model_len) onto existing >= 0 predicates (col is index into OutLogits rows). Same correctness class as preshuffle path; addresses review asking to cover non-preshuffle variant. Made-with: Cursor
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary (memory safety only — see "Scope" below)
Several
gl.amd.cdna3.buffer_store(ptr=OutLogits_buffer, …)sites in the Gluonpa_mqa_logitskernels writefloat32logits at logical column indicescolwithout an upper bound on
col < max_model_len. Existing predicates only coverthe lower bound (e.g.
col >= 0orcol >= split_context_start). Whencontext_length == max_model_lenandsplit_context_lengthis rounded up tothe next
KVBlockSize, the inner-loop tail can issue stores up to(KVBlockSize − 1) + (ChunkKPerStage − 1)columns past the end of theOutLogits_bufferallocation. That is the textbook unmasked-buffer_storeovershoot pattern.
This patch combines
(col < max_model_len)with the existing predicate via&at everyOutLogits_bufferstore site in:_gluon_deepgemm_fp8_paged_mqa_logits_gluon_deepgemm_fp8_paged_mqa_logits_preshuffle_gluon_deepgemm_fp8_paged_mqa_logits_preshuffle_varctxSo lanes with
col >= max_model_lenare predicated off and no longer issue abuffer_store. Nothing else in the kernel body (KV/scale loads, MFMA,gl.reduce, thetl.wherethat fillso) is touched, so for any in-boundscolumn the value of
OutLogits_buffer[batch, col]is bit-identical before andafter this PR.
Scope (read this first)
undefined-behaviour writes that can produce HIP "Memory access fault by GPU"
(MAF) failures or silently corrupt neighbouring memory.
columns. By construction it cannot — the upstream computation path is
untouched.
numerical regression in
_gluon_deepgemm_fp8_paged_mqa_logits_preshuffleon
gfx950(top-k mismatch vs.deepgemm_fp8_paged_mqa_logits_stage1oncecontext_lenexceeds ~2048; see vLLM#39303). That accuracy
bug is a separate kernel-integration issue (SplitKV tiling, KV-block
addressing in the
LoadBlockIndiceForEachStagebranch, MFMA operandorientation) and will be addressed in a follow-up.
Worked bounds example
With
max_model_len = 100,ChunkKPerStage = 32,context_idx = 75: lanecolumns
75 … 106were emitted. Only75 … 99are valid;the last 7 lanes were performing OOB stores. With this PR they are
predicated off.
With
ChunkK = 256,ChunkKPerStage = 128,max_model_len = 5120,context_idx = 5040: indices up to5167are emitted; 48 tail lanespast
5119were performing OOB stores and are now predicated off.Validation
pa_mqa_logits.pymain, not caused by this PR; see CI artifacts)_cached_paged_logitsby +256 cols)n_spec=1, c=4on MI355XThe vLLM-side variant is the operational equivalent of this PR (it lets the
kernel's overshoot land in safe row padding instead of relying on the kernel
to predicate). Both are valid mitigations for the same memory-safety bug; the
kernel-side fix in this PR removes the dependency on caller-side padding and
is the right layer for the fix.
Patch reproducibility
Regenerate via the supplied diff or cherry-pick from the branch tip; only
buffer_storepredicates change.Checklist
(col < max_model_len)&-combined with priorpredicate on every
OutLogits_bufferstoreOOB write is the cause of the MAF-class crash
from draft to ready-for-review